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Abstract — Discrete Cosine Transform (DCT) is the major 
building blocks in an image and video compression system, 
which can be achieved using various specialized algorithms. It 
is also being used in various standardized coding schemes. Such 
as JPEG, MPEG-2, and various others.The computation 
involved while performing DCT through direct approach 
requires a large number of multiplications which are 
time-consuming and eventually results in adding delay while 
performing arithmetic operations. It can completely be avoided 
using Distributed Arithmetic (DA) approach and the proposed 
architecture is implemented on same grounds. However 
shifting and addition used in the proposed method is based on 
utilizing fixed point arithmetic, and thus providing more 
precise results as compared to existing architectures. The 
operation when implemented on Vertex-5 FPGA results in 
more precise results with the delay of 17 ns and only 2% of 
available LUTs. 

Index Terms — DCT, DA, FPGA. 


I. Introduction 

Discrete Cosine Transform (DCT) is very effective and most 
popular transform technique used in image compression 
among numerous available approaches due to its energy 
compaction and low complexity. For several standards like 
JPEG, MPEG-2, MPEG-4, H.261 etc., DCT has become an 
integral part for compression. There are a number of “fast” 
algorithms developed for discrete transform computation 
[1],[2],[3], however the continuous demand of high speed, 
high throughput and small latency architectures always put 
tremendous pressure on VLSI designers. DCT is the core of 
image compression algorithm and is calculated on 
comparatively smaller square matrices. DCT block receives 
an NxN matrix image, which is divided into smaller image 
blocks (4x4, 8x8, 16x16, etc) where each block is 
transformed from the spatial domain to the frequency 
domain [4]. 


A large number of adders along with a large number of 
multipliers are required for direct implementation of DCT. 
Distributed arithmetic (DA) is a good solution to implement 
multiplication without multiplier (as multiplier consumes 
more power) and implementing DCT using DA offers 
several advantages in terms of area and speed. DA is an 
efficient method 

For calculation of inner product when one of the input 
vectors is fixed [1].DA provides best solution for computing 
multiply and accumulate (MAC) function. It uses pre 


computed look-up tables and accumulators instead of 
multipliers for calculating inner products and has been 

widely used in many DSP applications such as DFT, DCT, 
convolution, digital filters etc [9]. As DCT is based on 
solving sum of product (SOP) or MAC, DA can reduce 
multiplication to simply shift and add instead of 
multiplying. 

In this paper the periodicity and symmetry of DCT has 
been exploited for optimizing the performance and thereby 
reducing the computational redundancy. Using DA based on 
NED A (New Distributed Arithmetic) architecture [3], an 
efficient architecture for implementing more precise DCT 
computation architecture using fixed point calculation for 
shifting and adding operation has been proposed. Moreover 
the integer and decimal parts of multiplication operation has 
been done separately for getting more precise results. 

Rest of the paper is organized as follows. Section II and 
Section III deals with the realization of DCT using DA and 
the proposed work and result on 1-D DCT respectively. 
Section IV concludes the work. 


II. DCT USING DISTRIBUTED ARITHMETIC 


DCT transforms the information from the time or space 
domains to the frequency domain. DCT helps in expressing 
a finite number of data points as the sum of cosine functions 
that oscillate at different frequencies [6]. A 8x8 point 2D 
DCT can be expressed by Eq . 1 . 


y(k,m) = — ^ x( ij ) cos 
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Here: 

Y(k, m) and x(i,j) represents the transformed output and 
two dimensional input sequence respectively. Also k,m,i,j = 
0,1, ,7, and are defined as: 
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Eq. (1) can be expressed as the matrix vector representation. 

lY] = [C M ].U1 

( 2 ) 
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Here Cgxs is the DCT coefficient matrix. 
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The 2-D basis functions can be generated by multiplying the 
horizontally oriented 1-D basis functions with vertically 
oriented set of the same functions [6]. From the above vector 
representation, the 2-D DCT can be decomposed in two 1-D 
DCT vectors by using transpose in between the two same 
1-D DCT modules. The architecture basically consist of one 
1-D row DCT, transpose operation and one more 1-D 
column DCT. Based on this Eq. 2 can further be expressed 
using a matrix transpose and two ID DCTs as Eq. 3. 
lY] = [C].M.[f T ] 

....(3) 

After row column decomposition, the 8 point 1-D DCT is 
applied to each row of the input matrix and each (8x8) block 
of "semi transformed" values is transposed and has a further 
1-D DCT applied to it [7]. The 8 point 1-D DCT in matrix 


representation [7] is shown by Eq. 4. 
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From above matrix it can be concluded. C x = cos 



Direct implementation of Eq. 4 requires 64 multiplication 
and 56 additions and such a solution is not hardware 
efficient as multiplication operation requires more power 
and silicon area. Therefore by applying periodicity and 
symmetry property in DCT coefficient matrix, Eq.4 can be 
rewritten [7] as Eq.5. 
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By exploiting the coefficient matrix in above discussed 
form, multiplications can be reduced to 32, at the expense of 
further 8 additions/subtractions [7]. 

Y(V) = [Jf tOJ + rlU + xL2J+ xUl + t\A) + iLSJ + rt6J + xU) ]. C + 
riu = LriflJ - *<7]|JC 1 + 1*0-3 - + l*t2J - * + [fUl - xU)]Cr 

7121 = lx UU - xU) - xXAl + xl7 JJCzH- [ riU - xt2J - *£5) + 

YU) = LrtOJ - xU)\ - L *CU - - [x[2) - xiS) ]Cl-[i(3)- *£4|£; 

JX4J = [tfUU -I- fill - tLU - xU) - xt2) - xLE+xCiD + x (4)]Q 

rta = LrtOJ - sr(7JJC* - LrtU - -I- [T.f2) - xrS)]C + [*(3) - r(4]^ 

FC63 = [rUU + x L7J - rUJ - + [it2J + jrLSJ - xLD - rf6)]C 

FL5J = LrUU - xt7JJC^ - LrtU - + [x[Tj - [r{3) - x [4]^ 

(6) 


On application of DA to Eq. 6, these 32 multiplication 
operation can be replaced by addition and shifting operation. 


DA is a very common method to implement multiplication 
without multiplier. Traditional DA differs from NEDA 
primarily in two ways: 

1. In DA, input words are distributed into bits, making DA 
mechanism a bit-serial design and 

2. ROM is introduced to store a lookup table obtained from 
pre-computing results for all possible combinations of input 
bit patterns. 


Consider following example of sum of products for NEDA: 


Y=£j; =1 A k X k = [A,A 2 - A l ], 


■(7) 


Where A k is constant, X k is input data. A k can be 
represented as: 
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From above matrix one can conclude that as the 
multiplication matrix consists of only ‘0’ and ‘1’, so the 
computation simply reduces to two operations: addition and 
shifting. 

The constants C v C 2 , C 3 ,C S . C 6 , C 7 can be assumed to have 
following values as discussed before (in section II as 
(C x = cos(g)) 
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Y(0),Y(1),Y(2) Y(7) as discussed in section (II) can 

be calculated by exploiting symmetry in DCT and DA. It can 
be understood by taking some examples (Y1 from Eq.6 is 
repeated here): 

7£t3 = UM —*[7}!^+ U£i) - jtCfi }lq, + [*£23 - JcCEMC* + [*(33 -*C4 }]C, 
(10) 

Eq.10 can again be represented in vector matrix form as 
shown below where values of Ci, C 3 C 5 C 7 are taken from 
Eq. 9 
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n i] = 



-xrn-xuy 

jfCi] 

xt 2>-jeCE] 
JeC3l - 


(ii) 


The value of Qoos = 0.49039 and its binary 
representation is: 


(0.- 49039) 10 = (.011111011000)3 

( 12 ) 



Fig. 1 Adder matrix for intermediate calculations 


The above Eq. can also be represented as: 
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So, the Eq. for Y(l) in vector form can be represented as 
shown below: 


Fig (1) represents the adder matrix for the intermediate 
values required for 

Y(0),Y(1),Y(2),Y(3),Y(4),Y(5),Y(6),Y(7). Where, 


ml = xG 4 x7 
m3 = x2 4 x5 
m5 = xG — x7 
ml = x2 — x5 
cl = ml 4 m.4 
c3 = ml — m4 


m2 = xl 4 x6 
m4 = x3 xA 
mb = xl - x6 
mS = x3 — x4 
c2 = m2 4 m3 
c4 = m2 - m3 


cl = cl 4- c2 c2 = cl - c2 

(15) 


In a similar way, 1-D DCT vector coefficient Y5 can be 
expressed as 
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III. PROPOSED METHOD FOR SHIFTING AND 
ADDING 

In the proposed method, the shifting and adding operation 
are performed separately by fixed point calculation for 
integer and decimal part.The input pixel values being 8 bits 
with MSB representing sign bit, the adder matrix (as can be 
seen from Fig. 1) of the maximum bit length for the matrix 
addition will be of 12 bits (ml+cl+c3+ sign bit). so every 

individual vector (Y D [ l), Y l ( 2}, Y u (l), Y 13 (l) is 

of 13 bits with MSB representing the sign bit before decimal. 


As evident from the Eq.(16) first value is not shifted as it 
is multiplied by —2" and twelve zero will be padded in LSB 
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to represent no fraction part. Next combination will be 
multiplied by 2 _1 so it will be shifted 1 bit left (for shifting 
we are padding MSB with the MSB bit of Y 1 (Dwhich will be 
either 0 or 1 depending upon positive or negative result), and 
LSB will be padded with eleven zeros as one of the LSB of 
Y 1 CD will get shifted to the fraction part. Similarly, next 
vector is to be multiplied with2 _ ~ ,2 MSB bits of next vector 
will be padded 

with the MSB bit of V s CD, and ten zeros at the LSB will be 
padded as discussed above. Each vector is thus shifted by one 
bit and added to the previous one, so the last value will 
totally be shifted to the fraction part. The above stated thing 
can be understood by following assignment example: 

Integer values Decimal values 


Z=l 


y°(i)[ii:oL 

[Y* (iXnir Ci)[n],r* (I)[iaj2l 


000000000000 } + 
y 3 (i}[il.ooooooooooo}+ 
r 1 (l)[2:0l 0000000000}+ 


{ Y 13 c i)[l 1]- Y 12 ( 1} [1 l],. „ ■„ ... ... . Y 12 (11:0)} 

By doing such a shifting and adding operation, more precise 
result are obtained and the Table I presents the comparison 
results with existing method. 

Tablel. Comparison of proposed method results with the previous existing 
architecture. 


Input 

pixel 

variable 

s 

Input 

pixel 

values 

Calculated 
1-D DCT 

coefficients 
through 
MATLAB 

Calculate 
d values 
from[8] 

Calculate 
d values 
from [5] 

Calculate 
d values 
from 
proposed 
method 

x(0) 

60 

151.3209 

149 

149 

151.3046 

x(1) 

40 

-32.4895 

-36 

-32 

-33.5244 

x(2) 

25 

33.1588 

31 

31 

33.1611 

x (3) 

55 

-1.7108 

-5 

-2 

-2.2888 

x(4) 

40 

17.6777 

16 

16 

17.6757 

x (5) 

42 

18.5074 

14 

18 

18.5046 

x (6) 

82 

-16.0309 

-18 

-18 

-17.9736 

x (7) 

84 

-5.0975 

-11 

-9 

-5.9050 


matrix which can be designed according to the equations 
discussed in the paper. The delay of the proposed work is 17 
ns and the overall design uses only 2% of the resources 
available. With proposed design efficient hardware 
utilization can be achieved and results in saving of chip area. 
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The design has been simulated in Xilinx 13.1 and has been 
implemented using vertex -5 (XC5VLX110T) FPGA. The 
results obtained are more precise from the previous reported 
architectures. The total delay involved in the proposed 
module is 17.903ns and the device utilization is only 2% 
(Slice/LUTs used are 1635 out of 69120 (2%)). 

IV. CONCLUSION 


The proposed design is a ROM free multiplication technique 
through adders and shifter. DA is mathematically proved for 
1-D DCT module and explanation for different input vectors 
has been given in this paper. By analyzing equations for 1-D 
DCTY(0),Y(1),Y(2),Y(3),Y(4),Y(5),Y(6),Y(7), it is 
concludedthat many adder/ subtractor can be shared in order 
to minimize the hardware by using compressed 
adder/subtractor 
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